Data Camp Project

Prediction of the success rate of the French "brevet des collèges"

Authors : Allain Cédric - Gerbeaux Alexis - Miura Clotilde - Muliukov Artem

Import

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

Introduction

Every year in France, just over 810,000 students in the fourth year of collège ("classe de troisième"), aged around 15, take the brevet des collèges, a diploma that completes the first cycle of secondary education. Since schooling is compulsory up to the age of 16 in France, all teenagers take the brevet exams, making it a universal measure of the level of French students of that age. In 2019, 86.5% of the 813,200 candidates obtained the national brevet diploma. Here is the evolution of the success rate in recent years.

In [2]:
tx = pd.read_csv("data/college/tx_succes.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.plot(tx.annee, tx.taux, '-o', color='orange')
plt.xlabel("Year")
plt.ylabel("Success rate (%)")
plt.title("Evolution of the national success rate in the 'brevet des collèges'", fontsize=14)
plt.ylim(50,100)
plt.show()
# Source : https://www.data.gouv.fr/fr/datasets/le-diplome-national-du-brevet-00000000/#_

However, this rate is not the same for everyone. There are two streams in the examination, the general stream, in which 90% of the students are enrolled, and the vocational stream, so called "fillière professionnelle". For 2019, the pass rate is 87.8% in the general stream compared to 77.2% for the vocational stream. There is a similarly large gap between girls and boys, with a pass rate of 90% compared to 83% for boys. Similarly, there are major inequalities among graduating students. Here is the distribution of the types of honors, the "mentions", during the last session of the secondary school certificate.

In [3]:
mentions = pd.read_csv("data/college/tx_mentions.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.pie(mentions.taux,
        labels=mentions.mention,
        colors=['darkgreen', 'green', 'lightgreen', 'lightyellow', 'red'],
        autopct='%1.1f%%')
plt.title("Breakdown of students in the 2019 'brevet' according to their mention")
plt.axis('equal')
plt.show()

Education is a major issue and is a central element of the policy of a country like France. France spends more than 150 billion euros a year on education. Around 20% of this sum, 30 billion euros, is allocated to training at the first cycle of secondary education, in other words, at the collège.

In 2017, the French national education system identified 7,200 public and private collèges. 79% of secondary school students were enrolled in a public school.

Social impact

First of all, predicting the brevet pass rate per collège is important because it makes it possible to apprehend spatial inequalities in the territory and to answer questions such as whether there is a divide between urban and rural colleges. However, predicting the success rate will above all, make it possible to identify the factors determining the success of pupils in classe de troisième in the brevet des collèges and, more generally, the determining factors in educational success. This will thus enable those in charge of education policy to (re)allocate human and financial resources where these essential factors of success are weaker than elsewhere, thus enabling the national education system to achieve its pedagogical objectives in secondary education. For example, small classes could be set up outside the REP/REP+ zones (a kind of priority education network) in a targeted manner.

This subject, through the evaluation of the brevet success rate, aims to identify the criteria for success in schools with high success rates so that investment can be made in collèges where students are on average less successful. The social impact of this system of allocating teaching and educational resources could help to move closer to the republican principle of equal opportunities nationwide.

Performance indicators for public policy action

If this resource allocation program works well, we can expect to see the lowest-performing collèges catch up. In other words, collèges that are currently, or repeatedly year after year, below the national average should be expected to catch up with the national average.

To capture some of this "catch-up" effect, one can first look at the evolution of the variance of the distribution of patent success rates. Furthermore, it is also possible to look at the evolution between two dates of the mean for the group located below the national average with respect to the national average: $$ \frac{\bar{y}^{1} - \bar{y}_{low}^{1}}{\bar{y}^{0} - \bar{y}_{low}^{0}}$$ where $\bar{y}^0$ is the national average success rate at time 0, and $\bar{y}_{low}^{0}$ is the average success rate among all schools that are in the first quartile of the distribution (the 25% of schools that have the lowest success rates), and similarly at time 1.

Thus, we expect this metric to be as small as possible, meaning that the below of the distribution is now more concentraded around the national average. On the contrary, a value close to 1 would mean that nothing has changed between the two dates, because even if the average of the last quartile would have increased, it is possible that this is only due to a general upward trend.

Another possible indicator would be to do the same thing but taking a specific group of collèges. For example, at date 0 we identify a group of collèges that are in difficulty (i.e., they have a very low success rate compared to other collèges), and it is within this group that we calculate the low average $\bar{y}_{low}^{0}$. At date 1, the low average $\bar{y}_{low}^{1}$ is calculated not on a new group of collèges but on the same as the previous data. Thus, the metric is more accurate in the sense that it can be used to track the specific evolution of a targeted group of collèges.

However, one must be aware that such a simplistic metric as this one does not make it possible to assess the effectiveness of a public policy as a whole, because on the one hand it takes only two dates, while on the other hand it is necessary to look over several years at the effects of a public policy. Moreover, in order to identify the real effects of a policy, it is necessary to evaluate a more advanced econometric model.

Definition of the problem: Predict the success rate per collège

  • Our target : The success rate per collège in 2017.

In order to achieve our goal, it seems crucial to achieve the most accurate patent examination pass rate possible. For this, we can use the "root mean square error" defined as follows: $$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$

In fact we will use the normalized RMSE, which is the RMSE divided by the standard deviation of the target variable $\sigma_y$: $$RMSE = \frac{\sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}}{\sigma_y}$$ The standard deviation of the target variable can be seen as a RMSE for a model that always predicts the average value. Thus, deviding the classic RMSE by $\sigma_y$ gives us a ratio which allow us to easily compare the performance of our model.

Data exploration

Import

In [4]:
import geopandas as gpd
import folium

pd.set_option('display.max_columns', 500)
pd.set_option("display.max_rows", 500)

import warnings
warnings.filterwarnings("ignore")

import sys
sys.path.insert(0, './tools')

Load data

First of all, data related to the collèges and their exam pass rates are loaded.

In [5]:
data_college = pd.read_csv('./data/college/data_college_filtered.csv', index_col=0)
print('shape of the college table:', data_college.shape)
data_college.head()
shape of the college table: (4186, 44)
Out[5]:
Appartenance EP Name Coordonnée X Coordonnée Y Etablissement sensible CATAEU2010 Situation relative à une zone rurale ou autre Commune code City_name Commune et arrondissement code Commune et arrondissement nom Département code Département nom Académie code Académie nom Région code Région nom Région 2016 code Région 2016 nom Nb élèves Nb 6èmes Nb 5èmes Nb 4èmes générales Nb 3èmes générales Nb 6ème SEGPA Nb 5ème SEGPA Nb 4ème SEGPA Nb 3ème SEGPA Nb SEGPA Nb 3èmes générales retardataires Nb divisions Nb 6èmes provenant d'une école EP Nb 5èmes 4èmes et 3èmes générales Latin ou Grec Nb 5èmes 4èmes et 3èmes générales Nb élèves pratiquant langue rare Nb 6èmes bilangues Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales Nb 6èmes 5èmes 4èmes et 3èmes générales Nb 3émes générales et insertion rentrée précédente passés en 2nde GT Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel Longitude Latitude Position target
0 HEP jules simon 267972.0 6744464.8 NON 111.0 urbain 56260 vannes 56260 VANNES 56 MORBIHAN 14.0 RENNES 53 BRETAGNE 53 BRETAGNE 774.0 175.0 185.0 198.0 216.0 0.0 0.0 0.0 0.0 0.0 37.0 30.0 43.0 32.0 0.0 774.0 147.0 38.0 214.0 -2.760887 47.658668 -2.760887 47.658668 47.6586681136,-2.76088651324 95.6
1 HEP raoul blanchard 942287.5 6538643.7 NON 111.0 urbain 74010 annecy 74010 ANNECY 74 HAUTE SAVOIE 8.0 GRENOBLE 82 RHONE-ALPES 84 AUVERGNE-ET-RHONE-ALPES 829.0 211.0 200.0 192.0 213.0 0.0 0.0 0.0 0.0 0.0 51.0 31.0 144.0 0.0 0.0 816.0 150.0 41.0 211.0 6.125997 45.904308 6.125997 45.904308 45.9043076805,6.12599674762 88.9
2 HEP georges duhamel 637692.1 6878029.4 NON 111.0 urbain 95306 herblay 95306 HERBLAY 95 VAL-D'OISE 25.0 VERSAILLES 11 ILE-DE-FRANCE 11 ILE-DE-FRANCE 372.0 102.0 85.0 84.0 85.0 0.0 0.0 0.0 0.0 0.0 20.0 14.0 0.0 12.0 0.0 356.0 57.0 20.0 84.0 2.148468 48.999161 2.148468 48.999161 48.9991613692,2.14846829582 76.3
3 HEP edouard herriot 585352.4 6816567.4 NON 111.0 urbain 28218 luce 28218 LUCE 28 EURE-ET-LOIR 18.0 ORLEANS-TOURS 24 CENTRE-VAL-DE-LOIRE 24 CENTRE-VAL-DE-LOIRE 517.0 124.0 131.0 133.0 116.0 0.0 0.0 0.0 0.0 0.0 16.0 20.0 0.0 23.0 0.0 504.0 76.0 29.0 119.0 1.449799 48.439256 1.449799 48.439256 48.4392557302,1.44979875281 89.2
4 HEP henri dheurle 372116.5 6400997.6 NON 111.0 urbain 33529 la teste de buch 33529 LA TESTE-DE-BUCH 33 GIRONDE 4.0 BORDEAUX 72 AQUITAINE 75 NOUVELLE-AQUITAINE 777.0 194.0 194.0 174.0 201.0 0.0 0.0 0.0 0.0 0.0 24.0 27.0 1.0 29.0 0.0 763.0 111.0 40.0 189.0 -1.135637 44.630780 -1.135637 44.630780 44.6307801744,-1.13563658848 87.4

Separate the database into train and split

In [6]:
from sklearn.model_selection import train_test_split

y = data_college['target'].values
X = data_college.drop('target', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train['target'] = y_train
X_train.to_csv('./data/data_college_filtered_TRAIN.csv')

X_test['target'] = y_test
X_test.to_csv('./data/data_college_filtered_TEST.csv')
  • Only public collèges (middle schools) which presented at least 50 students to the exam
  • 4.186 collèges after the merge of the table with the success rates per collège and a table with a lot of explainatory variables per collège found on data.gouv
  • 44 features per collège
  • geographic data per city to enrich te data. Data of 2015 and 2010. We will work with 2015. The table comes from INSEE and can be found at this link
  • The scrapping code and the merge code can be found in the notebook 'data_camp_get_data'.

Let's have a look at the main variables that we have:

  • Appartenance EP: categorical, the level of priority education network
  • Name: name of the collège
  • Etablissement sensible: boolean, True if the collège is "sensitive"
  • Situation relative à une zone rurale ou autre: categorical, situation relating to a rural or other area
  • Département code: code for the department the collège is in
  • Académie code: code for the academy the collège is in
  • Région code: code for the region the collège is in
  • Nb 6èmes (resp. 5èmes, 4èmes générales, 3èmes générales): number of pupils in each level ("générales": general stream)

Important note: in this database there are no missing data, so we don't have to worry about this problem afterwards.

In [7]:
# number of unique values
data_college.nunique()
Out[7]:
Appartenance EP                                                                        3
Name                                                                                2355
Coordonnée X                                                                        4183
Coordonnée Y                                                                        4182
Etablissement sensible                                                                 2
CATAEU2010                                                                             8
Situation relative à une zone rurale ou autre                                          3
Commune code                                                                        2771
City_name                                                                           2748
Commune et arrondissement code                                                      2812
Commune et arrondissement nom                                                       2789
Département code                                                                     101
Département nom                                                                      101
Académie code                                                                         31
Académie nom                                                                          31
Région code                                                                           27
Région nom                                                                            27
Région 2016 code                                                                      18
Région 2016 nom                                                                       18
Nb élèves                                                                            733
Nb 6èmes                                                                             236
Nb 5èmes                                                                             223
Nb 4èmes générales                                                                   220
Nb 3èmes générales                                                                   218
Nb 6ème SEGPA                                                                         34
Nb 5ème SEGPA                                                                         36
Nb 4ème SEGPA                                                                         35
Nb 3ème SEGPA                                                                         36
Nb SEGPA                                                                             118
Nb 3èmes générales retardataires                                                      79
Nb divisions                                                                          46
Nb 6èmes provenant d'une école EP                                                    267
Nb 5èmes 4èmes et 3èmes générales Latin ou Grec                                      151
Nb 5èmes 4èmes et 3èmes générales                                                     98
Nb élèves pratiquant langue rare                                                     704
Nb 6èmes bilangues                                                                   180
Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales       93
Nb 6èmes 5èmes 4èmes et 3èmes générales                                              211
Nb 3émes générales et insertion rentrée précédente passés en 2nde GT                4184
Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel    4184
Longitude                                                                           4184
Latitude                                                                            4184
Position                                                                            4184
target                                                                               359
dtype: int64
In [8]:
# Python type of each feature
data_college.dtypes
Out[8]:
Appartenance EP                                                                      object
Name                                                                                 object
Coordonnée X                                                                        float64
Coordonnée Y                                                                        float64
Etablissement sensible                                                               object
CATAEU2010                                                                          float64
Situation relative à une zone rurale ou autre                                        object
Commune code                                                                         object
City_name                                                                            object
Commune et arrondissement code                                                       object
Commune et arrondissement nom                                                        object
Département code                                                                     object
Département nom                                                                      object
Académie code                                                                       float64
Académie nom                                                                         object
Région code                                                                           int64
Région nom                                                                           object
Région 2016 code                                                                      int64
Région 2016 nom                                                                      object
Nb élèves                                                                           float64
Nb 6èmes                                                                            float64
Nb 5èmes                                                                            float64
Nb 4èmes générales                                                                  float64
Nb 3èmes générales                                                                  float64
Nb 6ème SEGPA                                                                       float64
Nb 5ème SEGPA                                                                       float64
Nb 4ème SEGPA                                                                       float64
Nb 3ème SEGPA                                                                       float64
Nb SEGPA                                                                            float64
Nb 3èmes générales retardataires                                                    float64
Nb divisions                                                                        float64
Nb 6èmes provenant d'une école EP                                                   float64
Nb 5èmes 4èmes et 3èmes générales Latin ou Grec                                     float64
Nb 5èmes 4èmes et 3èmes générales                                                   float64
Nb élèves pratiquant langue rare                                                    float64
Nb 6èmes bilangues                                                                  float64
Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales     float64
Nb 6èmes 5èmes 4èmes et 3èmes générales                                             float64
Nb 3émes générales et insertion rentrée précédente passés en 2nde GT                float64
Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel    float64
Longitude                                                                           float64
Latitude                                                                            float64
Position                                                                             object
target                                                                              float64
dtype: object

In order to add socio-economic context to our analysis, it was decided to use a second database, this time with city-level information.

In [9]:
cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
print('shape of the cities table', cities_data.shape)
cities_data.head()
shape of the cities table (36729, 9)
Out[9]:
insee_code LIBGEO REG DEP population SUPERF med_std_living poverty_rate unemployment_rate
0 01001 L'Abergement-Clémenciat 84 01 767.0 15.95 22228.000000 NaN 0.087766
1 01002 L'Abergement-de-Varey 84 01 241.0 9.15 22883.333333 NaN 0.081301
2 01004 Ambérieu-en-Bugey 84 01 14127.0 24.60 19735.200000 17.227132 0.158234
3 01005 Ambérieux-en-Dombes 84 01 1619.0 15.92 23182.666667 NaN 0.078759
4 01006 Ambléon 84 01 109.0 5.88 NaN NaN 0.137931

Let us briefly present the main variables of this second database.

  • insee_code: the Insee code of the city
  • LIBGEO: name of the city
  • REG: code for the région
  • DEP: code for the département
  • population: number of inhabitants in the city
  • poverty_rate: poverty rate in the city, in percentage
  • unemployment_rate: unemployment rate in the city, in percentage
In [10]:
# proportion of NaN values
cities_data.isna().sum() / cities_data.shape[0]
Out[10]:
insee_code           0.000000
LIBGEO               0.000000
REG                  0.000000
DEP                  0.000000
population           0.035068
SUPERF               0.035068
med_std_living       0.121974
poverty_rate         0.880476
unemployment_rate    0.035258
dtype: float64

The variables in this database are quite well filled out, except for the variable coding for the poverty rate, which has more than 88% missing data. Subsequently, we will fill the missing data by taking the average value at the department level, so that there are no more missing data.

In [11]:
# Python type of each feature
cities_data.dtypes
Out[11]:
insee_code            object
LIBGEO                object
REG                    int64
DEP                   object
population           float64
SUPERF               float64
med_std_living       float64
poverty_rate         float64
unemployment_rate    float64
dtype: object

Target visualization

As mentioned earlier, the objective here is to arrive at an accurate estimate of a college's pass rate on the Brevet exam. Therefore, the corresponding variable is named 'target'.

First, let's look at the simple distribution of these success rates.

In [12]:
plt.figure(figsize=(10,7))
plt.hist(data_college.target, bins=40)
plt.axvline(x=data_college.target.mean(), color="orange", label='mean')
plt.axvline(x=data_college.target.quantile(.5), color="red", label='median')
plt.xlabel("Success rate (%)")
plt.ylabel("Count")
plt.title("Distribution of the revet uccess rates", fontsize=14)
plt.legend(loc='best')
plt.show()
In [13]:
data_college.target.describe()
Out[13]:
count    4186.000000
mean       87.429025
std         7.611328
min        41.300000
25%        82.925000
50%        88.600000
75%        93.100000
max       100.000000
Name: target, dtype: float64
In [14]:
print("The variance of the success rates in 2019 is %.2f" %data_college.target.std())
The variance of the success rates in 2019 is 7.61

Naive merge with the city database

In order to integrate the socio-economic data available in our second database, it is possible to join the two databases at the city level, thanks to their unique code.

In [15]:
# merge on the city code
data_college = pd.merge(data_college, cities_data,
                        left_on='Commune et arrondissement code', right_on='insee_code', how='left')

# fill na by taking the average value at the departement level
city_col_with_na = []
for col in cities_data.columns:
    if cities_data[col].isna().sum() > 0:
        city_col_with_na.append(col)

#print(city_col_with_na)
for col in city_col_with_na:
    data_college[col] = data_college[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
    
data_college.head()
Out[15]:
Appartenance EP Name Coordonnée X Coordonnée Y Etablissement sensible CATAEU2010 Situation relative à une zone rurale ou autre Commune code City_name Commune et arrondissement code Commune et arrondissement nom Département code Département nom Académie code Académie nom Région code Région nom Région 2016 code Région 2016 nom Nb élèves Nb 6èmes Nb 5èmes Nb 4èmes générales Nb 3èmes générales Nb 6ème SEGPA Nb 5ème SEGPA Nb 4ème SEGPA Nb 3ème SEGPA Nb SEGPA Nb 3èmes générales retardataires Nb divisions Nb 6èmes provenant d'une école EP Nb 5èmes 4èmes et 3èmes générales Latin ou Grec Nb 5èmes 4èmes et 3èmes générales Nb élèves pratiquant langue rare Nb 6èmes bilangues Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales Nb 6èmes 5èmes 4èmes et 3èmes générales Nb 3émes générales et insertion rentrée précédente passés en 2nde GT Nb 3émes générales et insertion rentrée précédente passés en cycle professionnel Longitude Latitude Position target insee_code LIBGEO REG DEP population SUPERF med_std_living poverty_rate unemployment_rate
0 HEP jules simon 267972.0 6744464.8 NON 111.0 urbain 56260 vannes 56260 VANNES 56 MORBIHAN 14.0 RENNES 53 BRETAGNE 53 BRETAGNE 774.0 175.0 185.0 198.0 216.0 0.0 0.0 0.0 0.0 0.0 37.0 30.0 43.0 32.0 0.0 774.0 147.0 38.0 214.0 -2.760887 47.658668 -2.760887 47.658668 47.6586681136,-2.76088651324 95.6 56260 Vannes 53.0 56 53200.0 32.30 21046.666667 16.432284 0.172886
1 HEP raoul blanchard 942287.5 6538643.7 NON 111.0 urbain 74010 annecy 74010 ANNECY 74 HAUTE SAVOIE 8.0 GRENOBLE 82 RHONE-ALPES 84 AUVERGNE-ET-RHONE-ALPES 829.0 211.0 200.0 192.0 213.0 0.0 0.0 0.0 0.0 0.0 51.0 31.0 144.0 0.0 0.0 816.0 150.0 41.0 211.0 6.125997 45.904308 6.125997 45.904308 45.9043076805,6.12599674762 88.9 74010 Annecy 84.0 74 125694.0 66.93 22531.250000 12.108427 0.107961
2 HEP georges duhamel 637692.1 6878029.4 NON 111.0 urbain 95306 herblay 95306 HERBLAY 95 VAL-D'OISE 25.0 VERSAILLES 11 ILE-DE-FRANCE 11 ILE-DE-FRANCE 372.0 102.0 85.0 84.0 85.0 0.0 0.0 0.0 0.0 0.0 20.0 14.0 0.0 12.0 0.0 356.0 57.0 20.0 84.0 2.148468 48.999161 2.148468 48.999161 48.9991613692,2.14846829582 76.3 95306 Herblay 11.0 95 28341.0 12.74 25128.857143 9.647151 0.096764
3 HEP edouard herriot 585352.4 6816567.4 NON 111.0 urbain 28218 luce 28218 LUCE 28 EURE-ET-LOIR 18.0 ORLEANS-TOURS 24 CENTRE-VAL-DE-LOIRE 24 CENTRE-VAL-DE-LOIRE 517.0 124.0 131.0 133.0 116.0 0.0 0.0 0.0 0.0 0.0 16.0 20.0 0.0 23.0 0.0 504.0 76.0 29.0 119.0 1.449799 48.439256 1.449799 48.439256 48.4392557302,1.44979875281 89.2 28218 Lucé 24.0 28 16107.0 6.06 17900.333333 21.074758 0.200058
4 HEP henri dheurle 372116.5 6400997.6 NON 111.0 urbain 33529 la teste de buch 33529 LA TESTE-DE-BUCH 33 GIRONDE 4.0 BORDEAUX 72 AQUITAINE 75 NOUVELLE-AQUITAINE 777.0 194.0 194.0 174.0 201.0 0.0 0.0 0.0 0.0 0.0 24.0 27.0 1.0 29.0 0.0 763.0 111.0 40.0 189.0 -1.135637 44.630780 -1.135637 44.630780 44.6307801744,-1.13563658848 87.4 33529 La Teste-de-Buch 75.0 33 26110.0 180.20 21434.666667 10.666966 0.152706
In [16]:
plt.figure(figsize=(10,7))
plt.scatter(data_college['unemployment_rate'], data_college['target'])
plt.xlabel("Average unemployment rate in the departement")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the unemployment rate in the departement", fontsize=14)
plt.ylim(50,100)
plt.show()

At first glance, it would appear that the unemployment rate in the city has a negative effect on the exam pass rate. This supports the idea that socio-economic variables in the environment in which the school is located may be crucial explanatory factors.

Geographic visualization

Important note: In order to correctly see the different dynamic maps, please run this notebook making sure you have an internet connection (because it's dynamic)

Let's look at the distribution of the success rate by department.

In [17]:
# Create a dynamic map
def plot_per_department(column, data_college,
                        cmap='OrRd_r',
                        path_geo_data = './data/donnees_geographiques/fichiers_geopandas/'):
    '''
    Functions which returns an interactive map of France colored by department according
    to the value of the feature "column" (for example the target).
    
    Parameters:
        column (string): the name of the column to plot per department
        data_college: the data frame with the data per college
        path_geo_data: path to a geojson file of france
    '''
    
    dep_df = data_college.groupby('Département code').agg({column:'mean'}).reset_index()
    dep_df.columns = ['code', column] # we rename the department code because it's the key of the geojson file
    
    
    m = folium.Map(location=[46.45,1], zoom_start=6) #map centered on France

    # Add the chloropleth
    m.choropleth(
       geo_data=path_geo_data+'departements.geojson.txt', # geoJSON file or url to geojson
       name='choropleth',
       data=dep_df, # Pandas dataframe
       columns=['code',column], # key and value of interest from the dataframe
       key_on='feature.properties.code', # key to link the json file and the dataframe
       fill_color=cmap, # colormap
       fill_opacity=0.7,
       line_opacity=0.2,
       legend_name=column
    )
    
    display(m)
    
    return None
In [18]:
plot_per_department(column='target', data_college=data_college)

Now we can look in detail, at the city level.

Thanks to the function below, you can choose a department and display an interactive map of the collèges. You can click on the little little circles (each of them corresponds to a collège) and see some information about it:

  • its name
  • its success rate
  • its city and department
  • Its 'Appartenance EP':
    • 'HEP' = it's not a priority education college
    • 'REP' : it is
    • 'REPPLUS' : the highest level of priority
In [19]:
# Create a dynamic map for the cities in each department
def plot_cities_in_dep(column, dep_code, dep_name,
                       data_college,
                       url_parent='https://france-geojson.gregoiredavid.fr/repo/departements/',
                       cmap='OrRd_r'):
    '''
    column (string): name of the column to plot
    dep_code(string): the code of the department to plot 
    dep_name (string): the name of the department
    data_college : the data frame with data per college
    url_parent : the url of France GEOJSON
    '''

    cities_df = data_college[data_college['Département code']==dep_code]
    cities_df_group = cities_df.groupby('Commune et arrondissement code').agg({column:'mean'}).reset_index()
    cities_df_group.columns = ['code', column]
    
    #url to geojson of the department
    url = url_parent+dep_code+'-'+dep_name+'/communes-'+dep_code+'-'+dep_name+'.geojson'
    coords = gpd.read_file(url).loc[0].geometry.centroid.coords[0] #coordinates where the maps is centered
    
    m = folium.Map(location=[coords[1],coords[0]], zoom_start=10) #map centered on France
    
    # Add the chloropleth
    m.choropleth(
       geo_data=url,# url to geojson of the department
       name='choropleth',
       data=cities_df_group, # Pandas dataframe
       columns=['code',column], # key and value of interest from the dataframe
       key_on='feature.properties.code', # key to link the json file and the dataframe
       fill_color=cmap, # colormap
       fill_opacity=0.7,
       line_opacity=0.2,
       legend_name=column
    )
    
    #add the colleges marker 
    for ix, row in cities_df.iterrows():
        # Create a popup tab with the college name and its success_rate
        popup_df = pd.DataFrame(data=[['College', row['Name']], 
                                      ['Sucess rate', str(row['target'])+'%'], 
                                      ['City', row['Commune et arrondissement nom']],
                                      ['Department', row['Département nom']],
                                      ['Appartenance EP', row['Appartenance EP']]])
        popup_html = popup_df.to_html(classes='table table-striped table-hover table-condensed table-responsive', index=False, header=False)
        # Create a marker on the map
        folium.CircleMarker(location = [row['Latitude'],row['Longitude']], radius=2, popup=folium.Popup(popup_html), color='red', alpha=0.5, fill_color='#0000FF').add_to(m)

    display(m)
    
    return None
In [20]:
plot_cities_in_dep(column='target',
                   dep_code='75',
                   dep_name='paris',
                   data_college=data_college)
  • Without any surprise, some of the richest arrondissements like the $5^{th}$, the $7^{th}$or the $8^{th}$ have the highest average success rates (more than 94%)
  • Surprisingly the $10^{th}$ arrondissement has an average success rate (76-80%) lower than the national average (87.4 %). This is because the collège La Grange aux Belles which is 'REP' has a very bad success rate of 67% and there are only 4 collèges in the arrondissement. One must be careful of the interpretation of those rates in some cities.

Macro socio-economic features

Some of the following features' possible link with the target are worth investigating:

  • the median standard of living
  • the poverty rate
  • the unemployment rate

The median standard of living.

  • The national average is approximatively of 20 000 euros. In average, french cities have half of their households wich live with less than 20 000 euros per year
In [21]:
print('Mean median living wage: %.5f' %cities_data.med_std_living.mean())
Mean median living wage: 20633.28419
In [22]:
plot_per_department(column='med_std_living', data_college=data_college, cmap='Blues_r')
In [23]:
plot_cities_in_dep(column='med_std_living', dep_code='75', dep_name='paris',
                   data_college=data_college, cmap='Blues_r')
In [24]:
plot_cities_in_dep(column='med_std_living', dep_code='93', dep_name='seine-saint-denis',
                   data_college=data_college, cmap='Blues_r')
  • As expected by clicking on somme collèges in the darked blue zones which correspond to cities where the median standard of living is very low, the sucess rates of collèges are not good.
  • Example, the collège Jean Lurcat in Saint Denis has a success rate of 72.7%

The unemployment rate

  • National average (per city) in 2015 : 11%
In [25]:
print('Average unemployment rate: %.2f%%' %(cities_data.unemployment_rate.mean()*100))
Average unemployment rate: 11.05%
In [26]:
plot_per_department(column='unemployment_rate', data_college=data_college, cmap='Purples')
In [27]:
plot_cities_in_dep(column='unemployment_rate', dep_code='75', dep_name='paris',
                   data_college=data_college, cmap='Purples')
  • In the $19^{th}$ arrondissement, the unemployment rate is higher and the collèges have a success rate under the national average (87.3%)

The poverty rate

  • Definition : percentage of people who live with less than 60% of the median standard of living.
  • National average per city : 13.9%
In [28]:
print('Average poverty rate: %.02f%%' %cities_data.poverty_rate.mean())
Average poverty rate: 13.85%
In [29]:
plot_per_department('poverty_rate', data_college, cmap='Greens')
In [30]:
plot_cities_in_dep('poverty_rate', dep_code='62', 
                   dep_name='pas-de-calais', data_college=data_college, cmap="Greens")
  • For the department Pas-de-Calais where there are a lot of cities with a high poverty rate, it's not clear wether the poverty rate of the city has a clear impact on the success rate of a collège.
In [31]:
fig, ax = plt.subplots(1,3, figsize=(20,5))
sns.scatterplot(x='med_std_living', y='target', data=data_college, ax=ax[0])
sns.scatterplot(x='unemployment_rate', y='target', color='purple',
                data=data_college.dropna(subset=['unemployment_rate']), ax=ax[1])
sns.scatterplot(x='poverty_rate', y='target', color='green',
                data=data_college.dropna(subset=['poverty_rate']), ax=ax[2])
plt.show()
  • The median standard of living and the poverty rate seem to be discriminant
  • The influence of the unemployment rate is less clear

Priority Education Network and "sensitive" schools

The priority education policy aims to reduce the gaps in achievement between pupils enrolled in priority education and those who are not. Two types of networks have been identified: the REP+, which concern neighbourhoods or isolated sectors with the greatest concentration of social difficulties that have a strong impact on educational success, and the more socially mixed REPs that are more socially mixed but encounter more significant social difficulties than those of collèges and schools outside priority education. Not every collèges are in the priority education network, and in this case they are labelled as "HEP" (outside the priority education network, Hors éducation prioritaire).

In [32]:
ax = sns.boxplot(x="Appartenance EP", y="target", data=data_college)
ax.axes.set_title("Success rate according to the type of education network")
ax.set_xlabel("Type of education network")
ax.set_ylabel("Success rate");

It can easily be observed that the success rate is significantly lower when the collège is part of a priority education network (this is all the more the case for the REP+ network). However, no causal link can be drawn from this, since collèges are placed in a priority education system precisely when they have a certain amount of ground to make up in relation to other collèges.

Allongside the priority education networks, there exists another label for collège in difficulties, the établissements sensibles. The so-called "sensitive" schools are secondary schools in which a climate of insecurity prevails that seriously compromises pupils' schooling. They are not necessarily in priority education.

The development of violence in schools has led the Ministers of National Education and the Interior to strengthen their collaboration. The latter has led, since 1992, to the classification of certain public secondary schools as "sensitive" schools, without saying, however, that violence is present only in these schools.

Like REPs, sensitive establishments benefit from special measures. They are the subject of exceptional efforts in terms of innovative and adapted pedagogy, by strengthening the timetable potential (class splitting, tutoring, tutoring, directed studies, etc.) and by the allocation of additional jobs, by strengthening the presence of adults (increase in the number of senior educational advisers, boarding school teachers, day school supervisors, etc.) and by appointing two head teachers per class.

In [33]:
sns.catplot(x="Appartenance EP", y="target", hue="Etablissement sensible", kind="box", data=data_college)
plt.title('Success rate according to the type of education network');

Analysis of class size

Another determinant of academic success is the number of students per class. It is easy to understand that a smaller class size makes it much easier for the teacher to spend more time with each student, thus ensuring that all students progress through the classroom without some being left behind. This is true at all levels of education but mainly during the first years of schooling.

Unfortunately, the variable "number of students" per class is not present in our database. However, it is possible to create a variable "average number of pupils per class" from the variables "total number of pupils in the school" and "number of classes".

In [34]:
data_college['average_class_size'] = data_college['Nb élèves'] / data_college['Nb divisions']
In [35]:
plt.scatter(data_college['average_class_size'], data_college['target'])
plt.xlabel("Average class size")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the average class size", fontsize=14)
plt.ylim(50,100)
plt.show()

In this case, the effect of class size on the exam pass rate cannot be determined directly because the rest of the variables must be controlled for. However, it is interesting to keep such a variable, given its importance in the literature.

In the same way, it is possible to create other new variables, such as the share of pupils who are in a general stream or the share of pupils who are in a European or international section.

In [36]:
# percentage of pupils in the general stream
data_college['percent_general_stream'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / data_college['Nb élèves']
# percentage of pupils in an european or international section
data_college['percent_euro_int_section'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / data_college['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = data_college['Nb 5èmes'] + data_college['Nb 4èmes générales'] + data_college['Nb 3èmes générales']
data_college['percent_latin_greek'] = data_college['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
data_college['percent_segpa'] = data_college['Nb SEGPA'] / data_college['Nb élèves']

Analysis of the quantitative features

In [37]:
quant_features = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
                  'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb élèves pratiquant langue rare',
                  'Nb 3ème SEGPA',
                  'average_class_size', 'percent_general_stream', 'percent_euro_int_section']
data_college[quant_features].describe()
Out[37]:
Nb élèves Nb 3èmes générales Nb 3èmes générales retardataires Nb 5èmes 4èmes et 3èmes générales Latin ou Grec Nb élèves pratiquant langue rare Nb 3ème SEGPA average_class_size percent_general_stream percent_euro_int_section
count 4186.000000 4186.000000 4186.000000 4186.000000 4186.000000 4186.000000 4186.000000 4186.000000 4186.000000
mean 516.225036 119.627568 18.273053 17.682991 494.123268 4.247731 24.452214 0.238136 0.056751
std 163.357700 40.049045 11.141567 25.031394 156.442065 7.434012 2.502211 0.033351 0.022781
min 120.000000 0.000000 0.000000 0.000000 111.000000 0.000000 12.866667 0.101480 0.000000
25% 397.000000 91.000000 11.000000 0.000000 378.000000 0.000000 22.736656 0.216216 0.040541
50% 504.000000 116.000000 16.000000 13.000000 482.000000 0.000000 24.666667 0.237145 0.054127
75% 623.750000 145.000000 24.000000 25.000000 599.000000 9.000000 26.380952 0.257856 0.069565
max 1711.000000 419.000000 211.000000 211.000000 1698.000000 35.000000 32.095238 0.591667 0.175000
In [38]:
data_college[quant_features].hist(figsize=(16, 20), bins = 50, xlabelsize=8, ylabelsize=8)
plt.show()

We can observe that most of our quantitative features have a Gaussian distribution shape.

Correlation study

Finally, we can have a look on the correlation between our different quantitative features and our target.

In [39]:
features_for_corr = ['target', 'average_class_size',
                     'percent_general_stream', 'percent_euro_int_section',
                     'med_std_living', 'poverty_rate', 'unemployment_rate']
sns.heatmap(data_college[features_for_corr].corr(), cmap='YlGn')
plt.title('Anlaysis of correlations via heatmap');

Workflow

The work flow is composed of two essential elements that make up the submission: the feature extractor and the regressor. The first allows both the preparation of initial data and the creation of new variables. The second, on the other hand, allows a supervised learning model to be trained so that the success rate on the exam can be correctly predicted. This model is trained on a part of the base obtained from the feature extractor output, then is evaluated on the remaining part.

We will use a Random Forest Regressor model in order to predict the different success rates.

In [40]:
data_college = pd.read_csv('./data/college/data_college_filtered.csv', index_col=0)
y_array = data_college['target'].values
X_df = data_college.drop('target', axis=1)

cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']
In [41]:
def process_students(X):
    """Create new features linked to the pupils"""
    # average class size
    X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
    # percentage of pupils in the general stream
    X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
    # percentage of pupils in an european or international section
    X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
    # percentage of pupils doing Latin or Greek
    sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
    X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
    # percentage of pupils that are in a SEGPA class
    X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']

    return np.c_[X['average_class_size'].values,
                 X['percent_general_stream'].values,
                 X['percent_euro_int_section'].values,
                 X['percent_latin_greek'].values,
                 X['percent_segpa'].values]
        
        
def merge_naive(X):
    # merge the two databases at the city level
    df = pd.merge(X, cities_data,
                  left_on='Commune et arrondissement code', right_on='insee_code', how='left')

    # fill na by taking the average value at the departement level
    for col in keep_col_cities:
        if cities_data[col].isna().sum() > 0:
            df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))

    return df[keep_col_cities]
In [42]:
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer


# Transformers
students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
                'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
                'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
                'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
students_transformer = FunctionTransformer(process_students, validate=False)

num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
            "Nb 6èmes provenant d'une école EP"]
numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])

cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
            'Situation relative à une zone rurale ou autre']
categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])

merge_col = ['Commune et arrondissement code', 'Département code']
merge_transformer = FunctionTransformer(merge_naive, validate=False)

drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
             'Commune et arrondissement code', 'Commune et arrondissement nom',
             'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
             'Longitude', 'Latitude', 'Position']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols),
        ('students', make_pipeline(students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), students_col),
        ('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='mean')), merge_col),
        ('drop cols', 'drop', drop_cols),
        ], remainder='drop') # remainder='drop' or 'passthrough'
In [43]:
# check it works
preprocessor.fit_transform(X_df)
Out[43]:
array([[ 1.57816718e+00,  2.40664777e+00,  1.68101893e+00, ...,
         2.10466667e+04,  1.64322843e+01,  1.72886328e-01],
       [ 1.91489186e+00,  2.33173067e+00,  2.93772485e+00, ...,
         2.25312500e+04,  1.21084267e+01,  1.07960805e-01],
       [-8.82984188e-01, -8.64732351e-01,  1.55018891e-01, ...,
         2.51288571e+04,  9.64715116e+00,  9.67644240e-02],
       ...,
       [-6.87226673e-02,  8.42176087e-02, -9.22157608e-01, ...,
         2.26595000e+04,  1.08124686e+01,  1.17145873e-01],
       [ 4.51670034e-01,  5.83664956e-01, -2.04039942e-01, ...,
         1.90816667e+04,  1.95091294e+01,  1.80443702e-01],
       [-1.00542953e+00, -1.31423496e+00, -1.01192232e+00, ...,
         2.13980000e+04,  9.49384985e+00,  1.34922608e-01]])
In [44]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=5, max_depth=50, max_features=10)
In [45]:
from sklearn.metrics import make_scorer, mean_squared_error

def normalized_rmse(y_true, y_pred):
    """Normalized RMSE"""
    if isinstance(y_true, pd.Series):
        y_true = y_true.values

    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return rmse / np.std(y_true) 
    
custom_loss = make_scorer(normalized_rmse, greater_is_better=False)
In [46]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit

clf = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', regressor)])

cv = ShuffleSplit(n_splits=5, test_size=0.25)

scores_Xdf = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=custom_loss)

print("mean: %.2e (+/- %.2e)" % (scores_Xdf.mean(), scores_Xdf.std()))
mean: 9.41e-01 (+/- 1.83e-02)

Submission

In order to run a whole submission, we need to compile into two differents python files the feature extractor and the regressor. Basically, we create a subfolder inside the folder 'submission', we name it 'starting_kit' for example, and we place into this subfolder the two files feature_extrator.py and regressor.py. These two files simply contain the codes presented above, in a more structured way, as presented later.

Note: the metric used to evaluate our model is not defined in either of those two files but is defined in a more general file, problem.py, placed at the root of the project.

In [47]:
# feature_extractor.py

import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline 

class FeatureExtractor(object):
    def __init__(self):
        
        self.path = os.path.dirname(__file__)
        
        # read the database with the city informations
        self.cities_data = pd.read_csv(os.path.join(self.path, 'cities_data_filtered.csv'), index_col=0)
        self.keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']
        
        # Transformers
        
        self.students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
                             'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
                             'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
                             'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
        self.students_transformer = FunctionTransformer(self.process_students, validate=False)
        
        self.num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
                         "Nb 6èmes provenant d'une école EP"]
        self.numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])
        
        self.cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
                         'Situation relative à une zone rurale ou autre']
        self.categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])
        
        self.merge_col = merge_col = ['Commune et arrondissement code', 'Département code']
        self.merge_transformer = FunctionTransformer(self.merge_naive, validate=False)
        
        self.drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
                          'Commune et arrondissement code', 'Commune et arrondissement nom',
                          'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
                          'Longitude', 'Latitude', 'Position']
        pass

    def fit(self, X_df, y_array):
        X_encoded = X_df
        
        self.preprocessor = ColumnTransformer(
            transformers=[
                ('num', self.numeric_transformer, self.num_cols),
                ('cat', self.categorical_transformer, self.cat_cols),
                ('students', make_pipeline(self.students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), self.students_col),
                ('merge', make_pipeline(self.merge_transformer, SimpleImputer(strategy='mean')), self.merge_col),
                ('drop cols', 'drop', self.drop_cols),
                ], remainder='passthrough') # remainder='drop' or 'passthrough'

        self.preprocessor.fit(X_encoded, y_array)
        pass

    def transform(self, X_df):
        X_encoded = X_df
        X_array = self.preprocessor.transform(X_encoded)
        return X_array
    
    @staticmethod
    def process_students(X):
        """Create new features linked to the pupils"""
        # average class size
        X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
        # percentage of pupils in the general stream
        X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
        # percentage of pupils in an european or international section
        X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
        # percentage of pupils doing Latin or Greek
        sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
        X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
        # percentage of pupils that are in a SEGPA class
        X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']
        
        return np.c_[X['average_class_size'].values,
                     X['percent_general_stream'].values,
                     X['percent_euro_int_section'].values,
                     X['percent_latin_greek'].values,
                     X['percent_segpa'].values]
        
        
    def merge_naive(self, X):
        # merge the two databases at the city level
        df = pd.merge(X, self.cities_data,
                      left_on='Commune et arrondissement code', right_on='insee_code', how='left')

        # fill na by taking the average value at the departement level
        for col in self.keep_col_cities:
            if self.cities_data[col].isna().sum() > 0:
                df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
    
        return df[self.keep_col_cities]
        
In [48]:
# regressor.py

from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator


class Regressor(BaseEstimator):
    def __init__(self):
        self.reg = RandomForestRegressor(
            n_estimators=5, max_depth=50, max_features=10)

    def fit(self, X, y):
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)
In [49]:
!ramp_test_submission
Testing Prediction of succes rates at the Brevet exam
Reading train and test files from ./data ...
Reading cv ...
Training submissions/starting_kit ...
CV fold 0
	score  normalized rmse      time
	train          0.43705  1.028025
	valid          0.95440  0.448320
	test           0.94854  0.411667
CV fold 1
	score  normalized rmse      time
	train          0.45110  0.943786
	valid          0.90906  0.412820
	test           0.94620  0.418762
CV fold 2
	score  normalized rmse      time
	train          0.41144  0.971130
	valid          0.95395  0.476197
	test           0.94920  0.471500
CV fold 3
	score  normalized rmse      time
	train          0.42130  1.039829
	valid          0.92765  0.470854
	test           0.95002  0.415969
CV fold 4
	score  normalized rmse      time
	train          0.42068  0.950585
	valid          0.94567  0.410397
	test           0.94812  0.407459
CV fold 5
	score  normalized rmse      time
	train          0.42068  1.008203
	valid          0.94154  0.415711
	test           0.94138  0.417803
CV fold 6
	score  normalized rmse      time
	train          0.43042  0.934333
	valid          0.98223  0.416615
	test           0.93851  0.413901
CV fold 7
	score  normalized rmse      time
	train          0.43065  0.994172
	valid          0.91149  0.413359
	test           0.93771  0.406948
----------------------------
Mean CV scores
----------------------------
	score     normalized rmse        time
	train  0.42792 ± 0.011481  1.0 ± 0.04
	valid  0.94075 ± 0.022744  0.4 ± 0.03
	test   0.94496 ± 0.004677  0.4 ± 0.02
----------------------------
Bagged scores
----------------------------
	score  normalized rmse
	valid          0.90990
	test           0.86526